Introduction

Water is vital to human and animal health. Water quality has the most direct impact on aquatic lives. Water properties like pH, temperature, dissolved oxygen(DO), turbidity, hardness, pathogen, and suspended sediments are important in determining water quality. Microbiological indicators of water quality include bacteria such as fecal coliforms, Escherichia coli. Escherichia coli(E.coli) is a coliform bacterium living in the human intestines. Although most E.coli strains are harmless, some can cause bloody diarrhea, vomiting, stomach pains and other severe diseases. One strain(Shiga toxin-producing E.coli)even leads to kidney failure[1]. Drinking contaminated water is the most common way to get infected. In BC Approved Water Quality Guidelines, the maximum allowable concentration of drinking water sources is less than 10 CFU in 100 mL[2]. Turbidity is the measure of the cloudiness and relative clarity of water. It is an indicator of source water quality and filtration effectiveness, and an important parameter when designing and developing pollution prevention and treatment strategies. In source water bodies, mud, algae, plant pieces and chemicals in water can cause high turbidity, which significantly reduces the aesthetic quality of lakes and streams, and also increases the cost of treatment. The maximum allowable turbidity for raw drinking water without treatment for particulates is 1 NTU in the guidelines. As turbidity should be reduced as low as possible, source water with lower turbidity can lighten the burden on the water treatment system. Weather has influences on water quality through precipitation and temperature. Increased temperature would cause algae and bacteria growth, which will reduce water quality. Low temperature has a significant effect on log reduction of E.coli growth[3]. Intensive rainfall is associated with sediments rushed off from land to water, increasing the turbidity of the water[4].

Monitoring water quality is used to determine whether the water meets a range of water quality goals and understand current quality conditions. Water quality provides the necessary information to make decisions about regulating water quality and developing the water treatment process. In BC, there are currently 42 water quality monitoring stations that measure water quality by collecting water samples for lab analysis or sensoring physical and chemical parameters. The data is from one of the stations –North Alouette River at 132nd Ave Bridge near Maple Ridge. The water was sampled from the left bank upstream. The data records temperature, mean air temperature, E.coli level, turbidity, precipitation of the day, in the last 3 days and in the last 7 days. The water is sampled every two weeks from 2004 to 2020.

In this study, we hypothesize that water quality is related to precipitation and temperature, and we use generalized linear regression to test this hypothesis. We examined the following questions: (i) Does precipitation have an effect on turbidity and E.coli level? If so, which data(i.e.precipitation of today, accumulated rainfall in past three days, accumulated rainfall in the past week) has the greatest influence? (ii) The correlation between the water quality factors. (iii) How well does a linear regression model, generalized linear regression model, non-linear model fit the data set? (iv) Are the residuals temporal autocorrelated? We will conduct a series of statistical analyses to answer these questions. ## Method

Data collection and variables

The North Alouette River water quality data was collected by BC water quality monitoring station from 2004/03/03 to 2020/01/30 at 132nd Ave Bridge near Maple Ridge and sampled from the left bank upstream. It was collected to monitor turbidity and E.coli concentration. The measures include: * Sample time * Fecal Coliforms * E.Coli * Temperature * Turbidity * Precipitation today * Precipitation in last 3 days * Precipitation in last 7 days * Mean air temperature The water was sampled approximately every two weeks. Temperature is measured by submerging the thermometer two-thirds below the surface of water at a central flowing location. Mean air temperature is the average temperature of the sample day. Turbidity is an optical measurement that indicates the presence of suspended particles. It is commonly measured in Nephelometric Turbidity Units (NTU). Turbidity is measured by turbidity meters. E.coli is measured by mixing the water sample and fluorescence substance powder and checking fluorescence under UV light.

Data cleaning

Data cleaning, which deals with identifying and removing erroneous, incorrect, duplicate, inconsistent or incomplete data to improve the data quality[5], is the first process step in every data analysis. In our water quality data, we remove the incomplete rows by removing the rows contained NA. Original data has 368 entries. After data cleaning, there are 270 rows left.

Data exploration

When we access the data, the first step is to visualize the data and investigate the data outliers, normality, linear and non-linear relationship between variables, ask questions such as where are the data centered, how are they spread, and whether data transformation is required[6].

Box plot and outliers

Box plot, or box-and-whisker plot, is a standard way of visualizing the mean and distribution of a dataset based on five features: the minimum, the first quartile(Q1), the sample median, the third quartile(Q3), and the maximum. From the box plot, we can detect the outliers. Most statistical parameters such as mean, standard deviation, and correlation are highly sensitive to outliers. Any statistical calculation based on these parameters is affected by the presence of outliers. By checking the box plots of turbidity and E.coli, there are some dangerous outliers that need to be removed.

Scatter plot

Scatter plot is a visualizing tool which shows the relationship between two variables by plotting the predictor variable along the X axis and the response along the Y axis. To help visualize the relationship, a regression line is often added to the plot.

Pairplot

Pair plot visualizes the pairwise relationships in a dataset. It shows multiple pairwise scatter plots in one graph and can be used to analyze relationships between variables, and between variables and responses to detect collinearity. It shows the relationship for a combination of variables and the diagonal plots are the univariate plots. In the water quality data, we used the R base function pairs().

Correlation

Correlation, or dependence, refers to the degree to which a pair of variables are linearly related. A positive correlation indicates that variables increase or decrease together, while a negative correlation indicates that with one variable increasing, the other decreases, and vice versa. When predictor variables are highly correlated, they cannot independently predict the response, resulting in collinearity. However, the thresholds of correlation coefficients |r|<0.7 is an indicator. If |r| <0.7, collinearity will not be a big concern in regression analysis[8]. We use R package corrplot to visualize the correlation and RColorBrewer package to manage colors.

Multiple linear regression

From previous literature, there are typically multiple variables influencing turbidity and E.coli. Our hypothesis is that turbidity is proportional to accumulated precipitation in the last three days and temperature, because runoffs carry sediment and pollutants into the river, increasing turbidity; Higher temperature leads to more rapid Brownian motion. Based on the assumptions, we can describe the relationship by an intercept(\(\beta _{0}\)) and temperature coefficient \(\beta_{1}\), precipitation coefficient \(\beta_{2}\). We assume turbidity is normally distributed. A stochastic model can be built: \[y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i}+\epsilon_{i}\] For E.coli level, we assume that temperature and precipitation both play a role in E.coli growth. We can describe the relationship by intercept \(\beta _{3}\) and temperature coefficient \(\beta_{4}\), precipitation coefficient \(\beta_{5}\). \[z_{i} = \beta_{3} + \beta_{4}x_{1i} + \beta_{5}x_{2i}+\epsilon_{i}\] In our hypothesis, E.coli level is also normally distributed. We installed the nlme package, and used the function gls() and REML method fitting linear model. In complex systems such as river ecosystems, variables often interact with each other to influence outcomes. The water quality response varies due to the interaction of changes in both temperature and precipitation. We added an interaction term of precipitation and temperature to both models: Turbidity: \[y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \beta_{12}x_{1i}x_{2i}+\epsilon_{i}\] E.coli: \[z_{i} = \beta_{3} + \beta_{4}x_{1i} + \beta_{5}x_{2i} + \beta_{45}x_{4i}x_{5i}+\epsilon_{i}\] \(\beta_{12}\) and \(\beta_{45}\) are the precipitation and temperature interaction terms.

Model selection and model averaging

The goal of modeling is to determine the best possible model that captures all of the important features of data. Adding parameters may improve the performance and lead to a better fit outcome, but also increase the risk of overfitting. There are a few model evaluation statistics including Akaike Information Criterion(\(AIC\)), and likelihood-ratio test(\(LRT\)).

Likelihood Ratio Test

The likelihood-ratio test compares a pair of nested models based on the ratio of their likelihoods[8]. \[\lambda_{LR}=-2(\ln[\mathcal{L}(\textrm{Reduced})-\mathcal{L}(\textrm{Full})])\]

Akaike Information Criterion(AIC)

Information criteria(IC) approach is based on Kullback-Leibler information. It is used to select the best models that minimizes the K-L information loss. Akaike Information Criterion(AIC) is a famous estimation of K-L information loss. A lower value of AIC suggests that the model explains the data better[9]. AIC is represented by the following equation: \[AIC= -2\ln(\mathcal{L}) + 2\mathcal{K}\] \(\mathcal{L}\) is the maximum likelihood estimate of parameters; \(\mathcal{K}\) is the number of parameters in the model. A small sample size bias correction for AIC is AICc, given by: \[AICc=AIC+\frac{2k^2+2k}{n-k-1}\] #### Cross-validation Estimate of Accuracy Cross-validation is a resampling procedure used to evaluate regression models on a limited data sample. The data are randomly divided into k folds, and remove each fold while use the remaining data to re-fit the regression model. The accuracy of the model is the average of the accuracy of each fold[10].

Model averaging

Model averaging refers to the process of using several models at once for making predictions or inferring parameters. It is a useful tool to deal with model uncertainty. We will get the full benefits of model averaging when the models are of high variance. We subset and averaged the plausible models(only \(\Delta AICc\) within 2). We used the subset() function and confint() to estimate parameters.

Model Validation

We displayed a diagnostic check including residuals plot, normality, correlation, and heterogeneity check for the selected model to ensure the linear regression is trustworthy. We plotted the residuals versus fitted values to detect if there is any pattern of relationship and heteroscedasticity of the data. We used histogram and Q-Q plot to check the normality of residuals. Function qqPlot() is from the car package.

Heteroscedasticity

Heteroscedasticity refers to unequal scatter, more generally, if the covariance matrix has a non-constant diagonal. Heteroscedasticity is a major concern in regression analysis, as it invalidates the independent and identically distributed(i.i.d) assumption[11]. It is a problem because in ordinary least squares regression(OLS), it is assumed that all residuals have the same variance. With heteroscedasticity, standard errors will be poorly estimated resulting in misleading p-values. In order to detect heteroscedasticity, we need to identify the cause by plotting the residuals against each of the predictors. Then to deal with heteroscedasticity, we apply variance structure to the regression model we have built.

Fixed variance structure

\[\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}\times x_{i})\] Fixed variance structure allows for variance to increase with \(x\).

Power variance

\[\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}\times\lvert x_{i} \rvert^{2\delta_{j}})\] The power variance structure is a generalization of the fixed variance structure. If \(\delta > \frac{1}{2}\)the variance is allowed to spread more than a fixed variance for the same increase in \(x\).

Exponential variance

\[\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}\times e^{{2\delta_{j}}\times x_{i} })\] The exponential variance structure models the variance of the residuals as \(\sigma_{i}\) multiplied by an exponential function. If \(\delta_{j}> 0\) the variance increases with x. If \(\delta_{j}< 0\) the variance decreases with x.

Temporal autocorrelation

Temporal autocorrelation refers to the correlation between two different time series. The autocorrelation is a mathematical representation of the degree of similarity between a given time data point and a lagged version of itself over successive time intervals. Data that are close together in time are more related. With autocorrelation, each data point is related to a previous data point, thus the data point does not bring a full independent worth of information. In our study, temperature and precipitation can be temporally autocorrelated, because in the same season, the temperature and precipitation can be similar. We typically identify autocorrelation by plotting autocorrelation functions (ACFs) which are calculated by following equations.

\[var_{t} = \frac{1}{n}\sum(resid_{t}\times resid_{t})\] \[var_{t+i-1}=\frac{1}{n}\sum(resid_{t}\times resid_{t+i-1})\] \[ACF=var_{t+i-1}/ var_{t}\] ACF is typically used as a visual diagnostic tool, ranging from 1 to -1 and autocorrelation at log0=1. We used acf() to detect temporal autocorrelation. ### Binary data of turbidity In Guidelines for Canadian drinking water quality, it is recommended that water entering the distribution system has turbidity level of 1.0 NTU or less. Treatment system should be designed and operated to reduce the turbidity as low as possible[12]. In this case, we set the turbidity standard to 0.5NTU and binary coded the data. Water sample with turbidity greater than 0.5 is marked as 0(failure), less than 0.5 is marked as 1(success). ### Generalized linear model

Generalized linear model is a flexible generalization of ordinary linear regression(OLS), which allows regression to have distributions other than a normal distribution. The linear model is related to the response variable via a link function. Generalized linear models were formulated as a way of unifying various other statistical models, including binomial distribution, Poisson distribution and negative binomial distribution[13].

Turbidty

The turbidity data is binary coded with 0(failure)-1(success). In this case, we applied logistic regression. As turbidity is 0-1 data, a good candidate is the binomial distribution. Binomial distribution describes the probability of obtaining k yes/no successes in a sample of size n. It is the distribution of the number of successful trials among a defined number of trials. Binomial distribution: \[Y_{i}\sim Binomial(1,\pi_{i})\] Expectation: \[E(Y_{i})=\pi_{i}\] Variance: \[Var(Y_{i})=\pi_{i}\times (1-\pi_{i})\] Link function: \[\pi_{i}=\frac{e^{\beta_{0}+\beta_{1}X+...}}{1+e^{\beta_{0}+\beta_{1}X+...}}\] #### E.Coli As we are working on counts of E.coli colonies, which is a discrete distribution with support between 0 and \(\infty\). The Poisson distribution and negative binomial distribution are good candidates for modeling E.coli data. For the Poisson distribution: \[Y_{i} = \frac{{e^{ - \lambda_{i}} \lambda_{i} ^x }}{{x!}}\] Link function: \[\ln(\lambda_{i})=\beta_{0}+\beta_{1}X+...\] Poisson distribution has only one parameter which stands for both mean and standard deviation. It can also be a constraint of Poisson distribution model. The variance is tied to the mean and therefore less flexible. Negative binomial distribution is often a viable alternative to the Poisson distribution as it allows for more heterogeneity because variance \(\neq\) mean. We used glm.nb from library MASS to model the negative binomial distribution. ### Non-linear model As biological relationship is not always linear, we can build non-linear model to fit the E.coli data. From previous studies, the optimum temperature for E.coli to grow should be around \(37^{\circ}C\)[14]. Temperatures below or above the optimum temperature are not suitable for E.coli growth. We can use the quadratic function, Ricker function and Holling type III to fit the data. In R, we used nonlinear least squares nls() function. Quadratic: \[f(x)=\beta_{0}+\beta_{1}x+\beta_{2}x^2\] Ricker function: \[f(x)=axe^{-bx}\] Holling type III: \[f(x)=\frac{ax^2}{b^2+x^2}\] ## Result

Import the data

Data Visualization

####Box plot and outliers

#Original box plot
boxplot(wq$Turbidity, ylab="Turbidity(NTU)")

boxplot(wq$E.Coli, ylab="E.Coli(CFU)")

#25% and 75% quantiles
Q1 <- quantile(wq$Turbidity, probs=c(.25, .75))
Q2 <- quantile(wq$E.Coli, probs=c(.25, .75))
iqr1 <- IQR(wq$Turbidity)
iqr2 <- IQR(wq$E.Coli)
#Exclude outliers
eliminated1<- subset(wq, wq$Turbidity > (Q1[1] - 1.5*iqr1) & wq$Turbidity < (Q1[2]+1.5*iqr1))
eliminated2<- subset(eliminated1, eliminated1$E.Coli > (Q2[1] - 1.5*iqr2) & eliminated1$E.Coli < (Q2[2]+1.5*iqr2))
wq_rmout<-eliminated2
#box plots of cleaned data
boxplot(wq_rmout$Turbidity, ylab="Turbidity(NTU)")

boxplot(wq_rmout$E.Coli, ylab="E.Coli(CFU)")

We can see that outliers in both E.coli and turbidity are significantly different from the other data points. They increased the variability in the data and decreased statistical power. Excluding outliers is necessary in this case. After removing the outliers, we find that the median of turbidity is 0.3 NTU, and the median of E.coli is around 3 CFU/ml.

#Scatter plot
#Turbidity~Precipitation
par(mfrow=c(3,3))
plot(wq_rmout$Precipitation.Today, wq_rmout$Turbidity, xlab="Precipitation Today(mL)", ylab="Turbidity(NTU)")
abline(lm(wq_rmout$Turbidity~wq_rmout$Precipitation.Today),col="blue")
plot(wq_rmout$Precipitation.Last.3.Days, wq_rmout$Turbidity, xlab="Precipitation in Last 3 Days(mL)", ylab="Turbidity(NTU)")
abline(lm(wq_rmout$Turbidity~wq_rmout$Precipitation.Last.3.Days),col="blue")
plot(wq_rmout$Precipitation.Last.7.Days, wq_rmout$Turbidity, xlab="Precipitation in Last 7 Days(mL)", ylab="Turbidity")
abline(lm(wq_rmout$Turbidity~wq_rmout$Precipitation.Last.7.Days),col="blue")
#E.coli~Precipitation
plot(wq_rmout$Precipitation.Today, wq_rmout$E.Coli, xlab="Precipitation Todays(mL)",ylab="E.Coli(CFU/mL)" )
abline(lm(wq_rmout$E.Coli~wq_rmout$Precipitation.Today),col="blue")
plot(wq_rmout$Precipitation.Last.3.Days, wq_rmout$E.Coli, xlab="Precipitation Last 3 Days(mL)",ylab="E.Coli(CFU/mL)" )
abline(lm(wq_rmout$E.Coli~wq_rmout$Precipitation.Last.3.Days),col="blue")
plot(wq_rmout$Precipitation.Last.7.Days, wq_rmout$E.Coli, xlab="Precipitation Last 7 Days(mL)",ylab="E.Coli(CFU/mL)" )
abline(lm(wq_rmout$E.Coli~wq_rmout$Precipitation.Last.7.Days),col="blue")
#E.coli~Temperature
plot(wq_rmout$Temp, wq_rmout$E.Coli, xlab="Temperature(°C)", ylab="E.Coli(CFU/mL)")
abline(lm(wq_rmout$E.Coli~wq_rmout$Temp),col="blue")

\ We plotted turbidity against precipitation today, precipitation the last 3 days and precipitation the last 7 days. Scatter plots suggest that turbidity is proportional to precipitation. From the scatter plots of turbidity against precipitation in the last 3 days, there is a significant linear relationship. As temperature rises, the amount of E.coli colonies increase. The relationship between E.coli and precipitation in the last 7 days is the strongest among precipitation variables.

#Pairplot
df<-subset(wq_rmout, select = -c(Sample.time) )
pairs(df)

#Correlation
cor(df)
##                           Coliforms.Fecal      E.Coli       Temp   Turbidity
## Coliforms.Fecal                1.00000000  0.91633534  0.5300363 -0.06034534
## E.Coli                         0.91633534  1.00000000  0.5770368 -0.16045591
## Temp                           0.53003633  0.57703684  1.0000000 -0.18988139
## Turbidity                     -0.06034534 -0.16045591 -0.1898814  1.00000000
## Precipitation.Today            0.02146397 -0.02796446 -0.1180382  0.35245191
## Precipitation.Last.3.Days     -0.12044188 -0.14873152 -0.1917028  0.43484599
## Precipitation.Last.7.Days     -0.17281128 -0.19283391 -0.2627462  0.34863101
## Mean.Air.Temp                  0.43910914  0.46398508  0.8364151 -0.15993450
##                           Precipitation.Today Precipitation.Last.3.Days
## Coliforms.Fecal                    0.02146397                -0.1204419
## E.Coli                            -0.02796446                -0.1487315
## Temp                              -0.11803816                -0.1917028
## Turbidity                          0.35245191                 0.4348460
## Precipitation.Today                1.00000000                 0.2751079
## Precipitation.Last.3.Days          0.27510788                 1.0000000
## Precipitation.Last.7.Days          0.22855777                 0.6884246
## Mean.Air.Temp                     -0.09707279                -0.1785742
##                           Precipitation.Last.7.Days Mean.Air.Temp
## Coliforms.Fecal                          -0.1728113    0.43910914
## E.Coli                                   -0.1928339    0.46398508
## Temp                                     -0.2627462    0.83641510
## Turbidity                                 0.3486310   -0.15993450
## Precipitation.Today                       0.2285578   -0.09707279
## Precipitation.Last.3.Days                 0.6884246   -0.17857416
## Precipitation.Last.7.Days                 1.0000000   -0.25144136
## Mean.Air.Temp                            -0.2514414    1.00000000
#Corrplot
corr<-cor(df,use="complete.obs", method="kendall")
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)
M <-cor(df,use="complete.obs", method="kendall")
corrplot(M, type="upper", order="hclust",col=brewer.pal(n=8, name="RdYlBu"))

From pair plots and correlation matrix, there are weak linear relationships between variables except for Temperature and Mean Air Temperature, E.coli and Fecal Coliforms. We should not include both of them in a model, due to collinearity. \(R^2\) of other relationships are both less than 0.7. Apart from adding model complexity, there is no concern of collinearity. From the correlation matrix, we find that the relationship between turbidity and precipitation in the last 3 days is the most significant(0.43), and the relationship between E.coli and precipitation in the last 7 days is the most significant(-0.19). We would use precipitation in the last 3 days and precipitation in the last 7 days separately when building the model of turbidity and E.coli. ### Multiple Linear Regression First, we build the regression model of turbidity with variables of temperature and precipitation in the last 3 days.

#Turbidity~temperature&precipitation
library(nlme)
mod1<-gls(Turbidity~Precipitation.Last.3.Days + Temp, data=wq_rmout)
summary(mod1)
## Generalized least squares fit by REML
##   Model: Turbidity ~ Precipitation.Last.3.Days + Temp 
##   Data: wq_rmout 
##       AIC       BIC  logLik
##   -223.81 -210.3461 115.905
## 
## Coefficients:
##                                 Value   Std.Error   t-value p-value
## (Intercept)                0.29485589 0.020444588 14.422198  0.0000
## Precipitation.Last.3.Days  0.00334633 0.000503699  6.643507  0.0000
## Temp                      -0.00364111 0.002050091 -1.776073  0.0771
## 
##  Correlation: 
##                           (Intr) P.L.3.
## Precipitation.Last.3.Days -0.505       
## Temp                      -0.825  0.192
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.6490704 -0.6881675 -0.2525559  0.5028867  3.1858990 
## 
## Residual standard error: 0.1328214 
## Degrees of freedom: 217 total; 214 residual
#with interaction
mod2<-gls(Turbidity~Precipitation.Last.3.Days + Temp +Precipitation.Last.3.Days*Temp, data=wq_rmout)
summary(mod2)
## Generalized least squares fit by REML
##   Model: Turbidity ~ Precipitation.Last.3.Days + Temp + Precipitation.Last.3.Days *      Temp 
##   Data: wq_rmout 
##         AIC       BIC   logLik
##   -206.8934 -190.0869 108.4467
## 
## Coefficients:
##                                      Value   Std.Error   t-value p-value
## (Intercept)                     0.30287028 0.022014508 13.757759  0.0000
## Precipitation.Last.3.Days       0.00245608 0.001037004  2.368441  0.0188
## Temp                           -0.00472701 0.002329385 -2.029296  0.0437
## Precipitation.Last.3.Days:Temp  0.00013948 0.000142021  0.982137  0.3271
## 
##  Correlation: 
##                                (Intr) Pr.L.3.D Temp  
## Precipitation.Last.3.Days      -0.552                
## Temp                           -0.850  0.497         
## Precipitation.Last.3.Days:Temp  0.371 -0.874   -0.475
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.5309628 -0.7104594 -0.2191031  0.4888859  3.2543873 
## 
## Residual standard error: 0.1328324 
## Degrees of freedom: 217 total; 213 residual

Then we build the regression model of log transformed E.coli with variables of temperature and precipitation the last 7 days.

#E.Coli~temp&precipitation
wq_rmout$log.ecoli<-log(wq_rmout$E.Coli)
mod3<-gls(log.ecoli~Precipitation.Last.7.Days * Temp , data=wq_rmout)
summary(mod3)
## Generalized least squares fit by REML
##   Model: log.ecoli ~ Precipitation.Last.7.Days * Temp 
##   Data: wq_rmout 
##        AIC     BIC    logLik
##   690.1315 706.938 -340.0658
## 
## Coefficients:
##                                     Value  Std.Error   t-value p-value
## (Intercept)                     0.3719361 0.19735699  1.884585  0.0608
## Precipitation.Last.7.Days      -0.0049522 0.00382465 -1.294800  0.1968
## Temp                            0.1393927 0.02020820  6.897829  0.0000
## Precipitation.Last.7.Days:Temp  0.0007923 0.00053187  1.489564  0.1378
## 
##  Correlation: 
##                                (Intr) Pr.L.7.D Temp  
## Precipitation.Last.7.Days      -0.628                
## Temp                           -0.842  0.574         
## Precipitation.Last.7.Days:Temp  0.384 -0.844   -0.540
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.53055323 -0.84149849 -0.02468937  0.74327514  2.66397322 
## 
## Residual standard error: 1.083404 
## Degrees of freedom: 217 total; 213 residual

Model Selection

#Turbidity~precipitation
library(MuMIn)
dredge(mod2)
## Warning in dredge(mod2): comparing models fitted by REML
## Fixed term is "(Intercept)"
## Global model call: gls(model = Turbidity ~ Precipitation.Last.3.Days + Temp + Precipitation.Last.3.Days * 
##     Temp, data = wq_rmout)
## ---
## Model selection table 
##    (Int) Prc.Lst.3.Dys       Tmp Prc.Lst.3.Dys:Tmp df  logLik   AICc delta
## 2 0.2649      0.003518                              3 119.604 -233.1  0.00
## 4 0.2949      0.003346 -0.003641                    4 115.905 -223.6  9.47
## 8 0.3029      0.002456 -0.004727         0.0001395  5 108.447 -206.6 26.49
## 1 0.3163                                            2 103.657 -203.3 29.84
## 3 0.3634               -0.006252                    3 102.423 -198.7 34.36
##   weight
## 2  0.991
## 4  0.009
## 8  0.000
## 1  0.000
## 3  0.000
## Models ranked by AICc(x)
mod4<-gls(Turbidity~Precipitation.Last.3.Days , data=wq_rmout)
summary(mod4)
## Generalized least squares fit by REML
##   Model: Turbidity ~ Precipitation.Last.3.Days 
##   Data: wq_rmout 
##         AIC       BIC   logLik
##   -233.2082 -223.0962 119.6041
## 
## Coefficients:
##                                Value   Std.Error   t-value p-value
## (Intercept)               0.26489963 0.011611959 22.812657       0
## Precipitation.Last.3.Days 0.00351782 0.000496827  7.080577       0
## 
##  Correlation: 
##                           (Intr)
## Precipitation.Last.3.Days -0.625
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.6028858 -0.6812103 -0.2792296  0.4921947  2.9493389 
## 
## Residual standard error: 0.1334853 
## Degrees of freedom: 217 total; 215 residual
plot(wq_rmout$Turbidity~wq_rmout$Precipitation.Last.3.Days)

#E.Coli~temp
dredge(mod3)
## Warning in dredge(mod3): comparing models fitted by REML
## Fixed term is "(Intercept)"
## Global model call: gls(model = log.ecoli ~ Precipitation.Last.7.Days * Temp, data = wq_rmout)
## ---
## Model selection table 
##    (Int) Prc.Lst.7.Dys    Tmp Prc.Lst.7.Dys:Tmp df   logLik  AICc delta weight
## 3 0.2512               0.1560                    3 -329.287 664.7  0.00  0.998
## 4 0.2590    -0.0001431 0.1556                    4 -334.553 677.3 12.61  0.002
## 8 0.3719    -0.0049520 0.1394         0.0007923  5 -340.066 690.4 25.73  0.000
## 1 1.4270                                         2 -363.932 731.9 67.23  0.000
## 2 1.6200    -0.0050730                           3 -366.726 739.6 74.88  0.000
## Models ranked by AICc(x)
mod5<-gls(log.ecoli~Temp, data=wq_rmout)
summary(mod5)
## Generalized least squares fit by REML
##   Model: log.ecoli ~ Temp 
##   Data: wq_rmout 
##        AIC      BIC    logLik
##   664.5733 674.6852 -329.2866
## 
## Coefficients:
##                 Value  Std.Error  t-value p-value
## (Intercept) 0.2512344 0.14401171 1.744541  0.0825
## Temp        0.1559565 0.01642067 9.497571  0.0000
## 
##  Correlation: 
##      (Intr)
## Temp -0.86 
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.67765741 -0.80727514 -0.02394595  0.72131158  2.77394036 
## 
## Residual standard error: 1.083968 
## Degrees of freedom: 217 total; 215 residual
plot(wq_rmout$log.ecoli ~ wq_rmout$Temp)

AICc based model selection performed by dredge() function favors Turbidity\(\sim\)Precipitation.Last.3.Days model(AIC=-233.2082) and log.ecoli\(\sim\)Temp model(AIC=664.5733). The coefficients of turbidity model are: Intercept: 0.26489963, Precipitation.Last.3.Days:0.00351782. The coefficients of log.ecoli model are: Intercept:0.2512344. Precipitation.Last.7.Days:0.1559565. ### Model Average We selected the model the \(\Delta\)AICc of 2 of the top model.

#Subset all models within $\Delta$AICc of 2 of the top model
subset1<-subset(dredge(mod2),subset = delta<2)
## Warning in dredge(mod2): comparing models fitted by REML
## Fixed term is "(Intercept)"
subset1
## Global model call: gls(model = Turbidity ~ Precipitation.Last.3.Days + Temp + Precipitation.Last.3.Days * 
##     Temp, data = wq_rmout)
## ---
## Model selection table 
##    (Int) Prc.Lst.3.Dys df  logLik   AICc delta weight
## 2 0.2649      0.003518  3 119.604 -233.1     0      1
## Models ranked by AICc(x)
subset2<-subset(dredge(mod3),subset = delta<2)
## Warning in dredge(mod3): comparing models fitted by REML
## Fixed term is "(Intercept)"
subset2
## Global model call: gls(model = log.ecoli ~ Precipitation.Last.7.Days * Temp, data = wq_rmout)
## ---
## Model selection table 
##    (Int)   Tmp df   logLik  AICc delta weight
## 3 0.2512 0.156  3 -329.287 664.7     0      1
## Models ranked by AICc(x)
#estimate parameters
confint(mod4, level=0.95)
##                                 2.5 %      97.5 %
## (Intercept)               0.242140607 0.287658648
## Precipitation.Last.3.Days 0.002544061 0.004491588
confint(mod5, level=0.95)
##                   2.5 %    97.5 %
## (Intercept) -0.03102339 0.5334921
## Temp         0.12377256 0.1881404

There is only the top model left. 95% confidence intervals of estimate parameters are: Turbidity: Intercept: (0.242140607, 0.287658648), Precipitation.Last.3.Days: (0.002544061, 0.004491588) log.ecoli: Intercept: (-0.03102339 0.5334921), Temperature:(0.12377256, 0.1881404) ### Model Validation The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. We performed diagnostic checks to ensure there are no remaining issues by plotting the residuals in histogram and Q-Q plot.

#Turbidity~precipitation
library(car)
## Loading required package: carData
hist(resid(mod4), xlab="Residuals", ylab="Frequency",main="Histogram of Turbidity Residuals")

qqPlot(resid(mod4))

## 247 100 
## 190  71
#E.Coli~temp&turbidity
hist(resid(mod5), xlab="Residuals", ylab="Frequency", main="Histogram of log.ecoli Residuals")

qqPlot(resid(mod5))

## 155  67 
## 116  49

Both of the distributions are skewed. The turbidity model is positively skewed whereas log.ecoli model is negatively skewed. Then we plot the residuals against predicted values to check if there is obvious patterns of residuals vs fitted value.

#residulas plot
plot(mod4, main="Turbidity")

plot(mod5, main="log.ecoli")

The residuals shows a significant pattern in both plots, which suggests heteroscedasticity in our models.
### Heteroscedasticity

#turbidity:
#exponential variance:
fix1 <- gls(Turbidity~Precipitation.Last.3.Days , weights = varExp (form= ~ Precipitation.Last.3.Days) ,data = wq_rmout)
plot(fit1)
## Error in plot(fit1): object 'fit1' not found
#log.ecoli
#exponential variance:
fix2 <- gls(log.ecoli~Temp, weights = varExp(form=~Temp), data=wq_rmout)
plot(fix2)

Models with fixed variance structure and power structure will cause convergence error in R. After fitting the data with an exponential variance structure, there is no significant improvement. Linear regression models are not appropriate for the data. ### Temporal Autocorrelation

#Turbidity~Precipitation
wq_rmout$date <- as.Date(wq_rmout$Sample.time)
plot(residuals(mod4, type = "normalized") ~ wq_rmout$date,
     col = "blue",
     xlab = "Date",
     ylab = "Residuals",
     pch = 20,
     cex = 0.8)
abline(h=0, col = "grey70", lty = "dashed")

acf(residuals(mod4, type = "normalized"))

#E.Coli~Temp
plot(residuals(mod5, type = "normalized") ~ wq_rmout$date,
     col = "red",
     xlab = "Date",
     ylab = "Residuals",
     pch = 20,
     cex = 0.8)
abline(h=0, col = "grey70", lty = "dashed")

acf(residuals(mod5, type = "normalized"))

ACFs are within the dotted line beyond which the autocorrelations are statistically significantly different from zero. From the ACF plots, temporal autocorrelation is not a concern for both models. ### Binary data of turbidity

acceptable<-rep(1, length(eliminated1$Sample.time))
eliminated1$accept<-acceptable
#subset turbidity >= 0.5 NTU
eliminated1$accept[eliminated1$Turbidity>=0.5] <- 0
plot(accept ~ Precipitation.Last.3.Days, data = eliminated1,
     pch = 16,
     col = "#046C9A",
     xlab = "Precipitation Last 3 Days (mm)",
     ylab = "Turbidity(0-1)")
abline(lm(accept ~ Precipitation.Last.3.Days, data = eliminated1), col = "#FF0000")

We exclude the outliers in turbidity data, and classify turbidity$\(0.5 NTU to label 0, and turbidity\)<$0.5 NTU to label 2. There is a negative relationship between precipitation last 3 days and turbidity, although the plot is hard to interpret. ### Generalized Linear Model #### Turbidity-Logistic Regression Model

#logistic regression model
mod_logistic <- glm(accept~Precipitation.Last.3.Days, family=binomial(link= "logit"), data=eliminated1, na.action = na.fail)
summary(mod_logistic)
## 
## Call:
## glm(formula = accept ~ Precipitation.Last.3.Days, family = binomial(link = "logit"), 
##     data = eliminated1, na.action = na.fail)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2077   0.4278   0.4443   0.5660   1.5434  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                2.345431   0.257564   9.106  < 2e-16 ***
## Precipitation.Last.3.Days -0.039677   0.008584  -4.622 3.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.76  on 248  degrees of freedom
## Residual deviance: 200.62  on 247  degrees of freedom
## AIC: 204.62
## 
## Number of Fisher Scoring iterations: 4
#plot
Precipitation.Last.3.Days<-seq(min(eliminated1$Precipitation.Last.3.Days), max(eliminated1$Precipitation.Last.3.Days),0.1)
new_data<-data.frame(Precipitation.Last.3.Days)
pred_logistic<-predict(mod_logistic, newdata = new_data, type = "link", se = TRUE)
Fit_linked_log <- exp(pred_logistic$fit)
Fit_linked_max_log <- exp(pred_logistic$fit + pred_logistic$se.fit*1.96)
Fit_linked_min_log <- exp(pred_logistic$fit - pred_logistic$se.fit*1.96)
plot(eliminated1$Precipitation.Last.3.Days, eliminated1$accept, xlab="Precipitation last 3 days", ylab="Turbidity(0-1)")
lines(Fit_linked_log ~ new_data$Precipitation.Last.3.Days,
       col = "red")
lines(Fit_linked_max_log ~ new_data$Precipitation.Last.3.Days,
       col = "grey70",
      lty = 3)
lines(Fit_linked_min_log ~ new_data$Precipitation.Last.3.Days,
       col = "grey70",
      lty = 3)

#Cross validation of the selected model
library(DAAG)
## Loading required package: lattice
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:car':
## 
##     vif
CVbinary(mod_logistic)
## 
## Fold:  3 4 1 8 5 7 2 6 10 9
## Internal estimate of accuracy = 0.835
## Cross-validation estimate of accuracy = 0.815
#Evidence ratio of selected model over an intercept only model
dredge(mod_logistic)
## Fixed term is "(Intercept)"
## Global model call: glm(formula = accept ~ Precipitation.Last.3.Days, family = binomial(link = "logit"), 
##     data = eliminated1, na.action = na.fail)
## ---
## Model selection table 
##   (Int) Prc.Lst.3.Dys df   logLik  AICc delta weight
## 2 2.345      -0.03968  2 -100.309 204.7  0.00      1
## 1 1.624                1 -111.381 224.8 20.11      0
## Models ranked by AICc(x)
#delta AICc is 41.88.
1/exp(-(1/2)*41.88)
## [1] 1242013885
#evidence ratio is 1242013885

The cross-validation estimate of accuracy of the model is 82%, which suggests the model is performing very well. The delta AICc is 20.11, and evidence ratio of selected model over an intercept only model is 1242013885. The evidence ratio and delta AICc values indicated that precipitation last 3 days is a critical predictor variable. #### E.Coli-Poisson&Negative binomial distribution

#Poisson
mod_poisson <- glm(E.Coli ~ Temp,
                   family = poisson(link = "log"),
                   data = wq_rmout)
summary(mod_poisson)
## 
## Call:
## glm(formula = E.Coli ~ Temp, family = poisson(link = "log"), 
##     data = wq_rmout)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8632  -2.2265  -1.2107   0.8349   8.5877  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.938142   0.053281   17.61   <2e-16 ***
## Temp        0.140574   0.004475   31.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2582.2  on 216  degrees of freedom
## Residual deviance: 1632.4  on 215  degrees of freedom
## AIC: 2360.9
## 
## Number of Fisher Scoring iterations: 5
#Pseudo R^2 
R2<-100*(mod_poisson$null.deviance-mod_poisson$deviance)/mod_poisson$null.deviance
R2
## [1] 36.78047
#Overdispersion
mod_poisson$deviance/mod_poisson$df.residual
## [1] 7.592769
1-pchisq(mod_poisson$deviance,mod_poisson$df.residual)
## [1] 0
#Negative binomial
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:DAAG':
## 
##     hills
mod_nb <- glm.nb(E.Coli ~ Temp,
                        link = "log",
                        data = wq_rmout)

# Inspect the summary overview of the model.
summary(mod_nb)
## 
## Call:
## glm.nb(formula = E.Coli ~ Temp, data = wq_rmout, link = "log", 
##     init.theta = 1.130455706)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0676  -1.0955  -0.5030   0.2665   3.3167  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03364    0.13826   7.476 7.66e-14 ***
## Temp         0.12943    0.01525   8.488  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.1305) family taken to be 1)
## 
##     Null deviance: 318.02  on 216  degrees of freedom
## Residual deviance: 227.65  on 215  degrees of freedom
## AIC: 1343.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.130 
##           Std. Err.:  0.120 
## 
##  2 x log-likelihood:  -1337.658
#plot new data and Poisson and negative binomial model
New_Data <- data.frame(Temp = seq(min(wq_rmout$Temp), max(wq_rmout$Temp), length.out = 200))
pred1 <- predict(mod_poisson, newdata = New_Data, type = "link", se = TRUE)
pred2 <- predict(mod_nb, newdata = New_Data, type = "link", se = TRUE)
#get the predictions on the correct scale
Fit_linked1 <- exp(pred1$fit)
Fit_linked2 <- exp(pred2$fit)
Fit_linked_max1 <- exp(pred1$fit + pred1$se.fit*1.96)
Fit_linked_min1 <- exp(pred1$fit - pred1$se.fit*1.96)
Fit_linked_max2 <- exp(pred2$fit + pred2$se.fit*1.96)
Fit_linked_min2 <- exp(pred2$fit - pred2$se.fit*1.96)
plot(E.Coli ~ Temp,
     data = wq_rmout,
     pch = 16,
     col = "#046C9A")
lines(Fit_linked1 ~ New_Data$Temp,
       col = "red")
lines(Fit_linked_max1 ~ New_Data$Temp,
       col = "grey70",
      lty = 3)
lines(Fit_linked_min1 ~ New_Data$Temp,
       col = "grey70",
      lty = 3)

plot(E.Coli ~ Temp,
     data = wq_rmout,
     pch = 16,
     col = "#046C9A")
lines(Fit_linked2 ~ New_Data$Temp,
       col = "green")
lines(Fit_linked_max2 ~ New_Data$Temp,
       col = "grey70",
      lty = 3)
lines(Fit_linked_min2 ~ New_Data$Temp,
       col = "grey70",
      lty = 3)

#Model selection
AIC(mod_poisson)
## [1] 2360.943
AIC(mod_nb)
## [1] 1343.658

In Poisson distribution generalized linear model, the \(R^2\) is 36.78, residual deviance / degrees of freedom = 7.59, and p-value for chi-square test is 0. The model is significantly overdispersed, and the Poisson assumption is not appropriate. The AICc of negative binomial model is 1343.658 and AICc of Poisson model is 2360.943. The best generalized linear model for modeling relationship between E.coli and temperature is negative binomial model. ### Non-linear model

#Polynomial
fit1<-gls(log.ecoli ~ Temp + I(Temp^2), data=wq_rmout)
summary(fit1)
## Generalized least squares fit by REML
##   Model: log.ecoli ~ Temp + I(Temp^2) 
##   Data: wq_rmout 
##        AIC      BIC    logLik
##   669.7677 683.2316 -330.8838
## 
## Coefficients:
##                 Value  Std.Error   t-value p-value
## (Intercept) 0.7205744 0.23190371 3.1072138  0.0021
## Temp        0.0115309 0.05865996 0.1965711  0.8443
## I(Temp^2)   0.0080538 0.00314372 2.5618685  0.0111
## 
##  Correlation: 
##           (Intr) Temp  
## Temp      -0.905       
## I(Temp^2)  0.790 -0.961
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -3.03131983 -0.83680633 -0.07727842  0.72483020  2.37104997 
## 
## Residual standard error: 1.07021 
## Degrees of freedom: 217 total; 214 residual
#Ricker
fit2<-nls(log.ecoli ~ a * Temp * exp(-b * Temp), start = list(a=1,b=1), data=wq_rmout)
summary(fit2)
## 
## Formula: log.ecoli ~ a * Temp * exp(-b * Temp)
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## a  0.176640   0.025056   7.050  2.4e-11 ***
## b -0.001803   0.010898  -0.165    0.869    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 215 degrees of freedom
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 3.711e-07
#Holling type III
fit3<-nls(log.ecoli ~ a * I(Temp^2)/(b^2 + I(Temp^2)), start=list(a=1, b=1), data=wq_rmout)
summary(fit3)
## 
## Formula: log.ecoli ~ a * I(Temp^2)/(b^2 + I(Temp^2))
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a   4.2630     0.6223   6.851 7.60e-11 ***
## b  11.0369     1.6281   6.779 1.15e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 215 degrees of freedom
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 2.153e-06
#functions
poly<-function(x){
  coef(fit1)[1]+coef(fit1)[2]*x+coef(fit1)[3]*x^2
}
ricker<-function(x){
  coef(fit2)[1]*x*exp(-coef(fit2)[2]*x)
}
holling<-function(x){
  coef(fit3)[1]*x^2/((coef(fit3)[2])^2+x^2)
}
x<-seq(0,20,0.1)
y1<-poly(x)
y2<-ricker(x)
y3<-holling(x)
#plot
plot(log.ecoli~Temp, data=wq_rmout, pch = 16, col = "#046C9A")
lines(y1~x,col = "red")
lines(y2~x,col = "green")
lines(y3~x, col="blue")

AIC(fit1)
## [1] 669.7677
AIC(fit2)
## [1] 657.8243
AIC(fit3)
## [1] 661.2103
#Derivative
ricker = expression(0.17664*x*exp(-0.001803*x))
D(ricker,"x")
## 0.17664 * exp(-0.001803 * x) - 0.17664 * x * (exp(-0.001803 * 
##     x) * 0.001803)
rickerd<-function(x){
  0.17664 * exp(-0.001803 * x) - 0.17664 * x * (exp(-0.001803 * x) * 0.001803)
}
#maximum log.ecoli
root<-uniroot(rickerd, c(0,1000),tol=0.1)
root$root
## [1] 554.649
 #Temp=554.649

From the data points pattern, we fitted polynomial(quadratic), ratio(Holling type III) and exponential(Ricker)models. The AICc of quadratic model is 669.7677, AICc of Ricker function model is 657.8243, and AICc of Holling type III is 661.2103. As there is an optimum temperature for E.coli growth, the Ricker model that has a maximum value and reaches 0 when x increases to \(\infty\) should be considered. However, the maximum of \(f(x)\) is reached when temperature is 554.649, which is not consistent with the optimum temperature(\(37^\circ C\)). This may be caused by the missing data when temperature exceeds \(21^\circ C\). Ricker function model is the best non-linear model yet. We used D function for computing derivatives and uniroot() for computing maximum point.

Discussion

Water quality is of great importance to our health and well-being. It is essential to reduce turbidity and E.coli of source water for sequential water treatment. Weather conditions can influence water quality through precipitation and temperature. Predicting the rise of E.coli level and turbidity using weather data could help regulators take emergency measurements to control the pollution at an early stage.

In our study, we explored the effect of precipitation parameters and correlation between the variables via correlation matrix and pair plots. There is a significant linear relationship between E.coli level and temperature, E.coli level and precipitation in the last 7 days, and between turbidity and precipitation in the last 3 days. We first investigated the effects of precipitation and temperature on turbidity and log.ecoli. We built multiple linear regression models to fit the data and conducted model selection based on AICc values. The coefficients suggest that turbidity is negatively related to precipitation in the last 3 days and E.coli level is positively related to temperature. However, the residuals of the models are not normally distributed and show significant heteroscedasticity. Heteroscedasticity refers to unequal scatter. In order to detect heteroscedasticity, we plotted the residuals against temperature and precipitation and applied a variance structure to the model. However, this did not show any significant improvement. It indicates that linear regression models are not appropriate for the data.

As the data is collected over time, and the duration is around two weeks, data points that are close together in time can be related. We plotted the residuals over time and plotted the autocorrelation function. The ACFs of the data points are within the dotted lines(95% CIs) indicating the temporal autocorrelation is not significant.

Based on Guidelines for Canadian Drinking Water, we can transform our turbidity data to 0(failure)-1(success) data. The turbidity under 0.5NTU is marked as 1, otherwise 0. we plotted the transformed data and added a regression line. Although from the plot, turbidity is reversely proportional to precipitation, the model is hard to interpret. Then we used generalized linear model. The distribution is binomial and the link function is logistic function. The cross-validation estimate of accuracy of the model is 82%, which suggests the model is performing very well.

For E.coli data, we built generalized linear models and non-linear models. As the count of E.coli is discrete and ranges from 0 to \(\infty\), Poisson distribution and negative binomial distribution can be applied to the data. The Poisson distribution only has one parameter that stands for both mean and standard deviation, which can be a constraint. Negative binomial distribution is a flexible alternative as its variance \(\neq\) mean. In Poisson distribution generalized linear model, the \(R^2\) is 21.59, and the p-value for chi-square test is 0.059. The model is significantly overdispersed, denoting Poisson assumption is not appropriate. AICc of negative binomial model is 1343.658 and AICc of Poisson model is 2360.943. The best GLM model for modeling the relationship between log.ecoli and temperature is negative binomial generalized linear model. From the data points distribution pattern, non-linear models can also be applied. We built a second order polynomial model, ricker function model and Holling type III model. As the E.coli growth has an optimum temperature, above or below which the growth can be impaired. Ricker function which has a maximum value and reaches 0 when x increases to \(\infty\) should be the best fit for the data. AICc model selection also favors the Ricker function model. However, the optimum temperature based on the Ricker function model is \(555^\circ C\), which is a lot higher than the actual optimum temperature(\(37^\circ C\)). It may be caused by the missing data when the temperature exceeds \(21^\circ C\). Ricker function model is the best non-linear model for the existing data.

Reference

[1] Porubsky, S., Federico, G., Müthing, J., Jennemann, R., Gretz, N., Büttner, S., … & Betz, C. (2014). Direct acute tubular damage contributes to Shigatoxin‐mediated kidney failure. The Journal of pathology, 234(1), 120-133.
[2] British Columbia Ministry of Environment and Climate Change Strategy. 2020. B.C. Source Drinking Water Quality Guidelines: Guideline Summary. Water Quality Guideline Series, WQG-01. Prov. B.C., Victoria B.C.
[3] Marois-Fiset, J. T., Carabin, A., Lavoie, A., & Dorea, C. C. (2013). Effects of temperature and pH on reduction of bacteria in a point-of-use drinking water treatment product for emergency relief. Applied and environmental microbiology, 79(6), 2107-2109.
[4] Tornevi, A., Bergstedt, O., & Forsberg, B. (2014). Precipitation effects on microbial pollution in a river: lag structures and seasonal effect modification. PloS one, 9(5), e98546.
[5] Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13. [6] Zuur, A., Ieno, E. N., & Smith, G. M. (2007). Analyzing ecological data. Springer.
[7] Adler, P. B., & Levine, J. M. (2007). Contrasting relationships between precipitation and species richness in space and time. Oikos, 116(2), 221-232.
[8] Glover, S., & Dixon, P. (2004). Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychonomic Bulletin & Review, 11(5), 791-806.
[9] Aho, K., Derryberry, D., & Peterson, T. (2014). Model selection for ecologists: the worldviews of AIC and BIC. Ecology, 95(3), 631-636.
[10] Browne, M. W. (2000). Cross-validation methods. Journal of mathematical psychology, 44(1), 108-132.
[11] Goldberger, A. S. (1964). Econometric theory. Econometric theory.
[12] Toft, P., Malaiyandi, M., & Hickman, J. R. (1987). Guidelines for Canadian drinking water quality.
[13] Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370-384.
[14] Hajna, A. A., & Perry, C. A. (1939). Optimum temperature for differentiation of Escherichia coli from other coliform bacteria. Journal of bacteriology, 38(3), 275.